Contextual tokenizer #1415

Jemoka · 2024-09-02T05:16:06Z

create a second tokenization stage which uses contextual word embeddings as well token embeddings to be more accurate. Doesn't seem to help for languages with clearly delinated orthographies but in Thai it helps a lot on dev and a smidge on test.

On UD_Thai-TUD:

New, test:

 Tokens Sentences     Words
    90.02     15.02     90.02

Old, test:

   Tokens Sentences     Words
    90.22     14.59     90.22

New, dev:

   Tokens Sentences     Words
    90.57     19.14     90.57

Old, test:

   Tokens Sentences     Words
    90.08     15.67     90.08

…into contextual_tokenizer

Jemoka and others added 20 commits September 2, 2024 00:31

initial work on contextual-aware tokenizer

271dcd9

last mile tokenizer changes for running (still slow)

7e9388d

small optmizitain changes

58e90b8

remove extra spaces

1b9a9ce

Merge branch 'contextual_tokenizer' of github.com:stanfordnlp/stanza …

eef83c3

…into contextual_tokenizer

use mwt info

23d4ab5

make the tokenizer a smidge more efficinet

38a3398

now use an LSTM

e4b691a

Merge branch 'contextual_tokenizer' of github.com:stanfordnlp/stanza …

393e655

…into contextual_tokenizer

various tokenizer changes for a smaller model

ca1423b

some edits to second pass classifier

a8934d5

bump second pass start steps

b687c6f

fixing some ordering problems

49ef2e6

split on even using draft positions

58345ba

initially, an identity transform

337405c

Merge branch 'contextual_tokenizer' of github.com:stanfordnlp/stanza …

e46c639

…into contextual_tokenizer

whopps, it was applied in the othre direction

e7ae33a

add some dropout

d6e4276

include character information in the tokenizer model

7545f39

add a tiny bit of dropout

6fe35de

Jemoka changed the title ~~[wip] Contextual tokenizer~~ Contextual tokenizer Sep 12, 2024

Jemoka requested a review from AngledLuffa September 12, 2024 23:33

Jemoka marked this pull request as ready for review September 12, 2024 23:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contextual tokenizer #1415

Contextual tokenizer #1415

Jemoka commented Sep 2, 2024 •

edited

Loading

Contextual tokenizer #1415

Are you sure you want to change the base?

Contextual tokenizer #1415

Conversation

Jemoka commented Sep 2, 2024 • edited Loading

Jemoka commented Sep 2, 2024 •

edited

Loading